Ch5 Monte Carlo Methods

  • Previous: Evaluation by Dynamic Programming
    • Counting all possible situation
  • Computing Evaluation by Sampling(Monte Carlo)

  • Behavior Policy for visiting all situations

Ch6 Temporal Difference Learning

As usual, we start by focusing on the policy evaluation or prediction problem, that of estimating the value function vπ for a given policy π. For the control problem (finding an optimal policy), DP, TD, and Monte Carlo methods all use some variation of generalized policy iteration (GPI). The differences in the methods are primarily differences in their approaches to the prediction problem.

  • Both TD and Monte Carlo methods use experience to solve the prediction problem.

  • $V(S_t)$ Update in Monte Carlo

    • $V(S_t) \leftarrow V(S_t) + \alpha [ G_t - V(S_t)]$
  • $V(S_t)$ Update in TD
    • $V(S_t) \leftarrow V(S_t) + \alpha [ R_t + \gamma V(S_{t+1}) - V(S_t)]$
    • Use $R_t + \gamma V(S_{t+1})$ as estimation of $G_t$

In [ ]: